GH-3505: Optimize ByteStreamSplitValuesReader page transposition#3506
Open
iemejia wants to merge 1 commit intoapache:masterfrom
Open
GH-3505: Optimize ByteStreamSplitValuesReader page transposition#3506iemejia wants to merge 1 commit intoapache:masterfrom
iemejia wants to merge 1 commit intoapache:masterfrom
Conversation
ByteStreamSplitValuesReader.decodeData eagerly transposes an entire page from stream-split layout (elementSizeInBytes streams of valuesCount bytes each) back to interleaved layout (valuesCount elements of elementSizeInBytes bytes each). The current loop performs one ByteBuffer.get(int) per byte, which incurs per-call bounds checks and virtual dispatch through HeapByteBuffer/DirectByteBuffer for every single byte of the page. For a 100k-value FLOAT page that is 400k get(int) calls; for DOUBLE/LONG it is 800k. This change rewrites decodeData in three steps: 1. Drop down to a byte[] view of the encoded buffer. When encoded.hasArray() is true (the typical case) use the backing array directly with the correct base offset; otherwise copy once with a single get(byte[]) call. This eliminates the per-byte ByteBuffer.get(int) bounds check and virtual dispatch. 2. Specialize loops for the common element sizes (4 and 8). Hoist all stream * valuesCount offsets out of the inner loop into local ints (s0..s3 for floats/ints, s0..s7 for doubles/longs), and write each output slot exactly once in a single sequential pass. Reads come from elementSizeInBytes concurrent sequential streams which modern hardware prefetchers handle well. 3. Generic fallback for arbitrary element sizes (FIXED_LEN_BYTE_ARRAY of any width) keeps the existing behaviour. Benchmark (new ByteStreamSplitDecodingBenchmark, 100k values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3, 30 samples per row): Type Before (ops/s) After (ops/s) Improvement Float 47,798,981 162,294,904 +240% (3.40x) Double 26,320,043 66,002,524 +151% (2.51x) Int 47,072,832 162,177,747 +245% (3.45x) Long 26,795,544 65,999,343 +146% (2.46x) Decoded output is byte-identical to before; per-op heap allocation is unchanged (the only allocation is the per-page decode buffer plus the boxing of returned primitives by the benchmark). All 573 parquet-column tests pass; 51 BSS-specific tests pass.
fa627d5 to
88a3b0e
Compare
iemejia
added a commit
to iemejia/parquet-java
that referenced
this pull request
Apr 19, 2026
… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).
iemejia
added a commit
to iemejia/parquet-java
that referenced
this pull request
Apr 19, 2026
… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
ByteStreamSplitValuesReaderis the symmetric reader forBYTE_STREAM_SPLIT-encodedFLOAT,DOUBLE,INT32, andINT64columns. OninitFromPageit eagerly transposes the entire page from stream-split layout (elementSizeInBytesseparate streams ofvaluesCountbytes each) back to interleaved layout. The current loop is:Two issues on the hot path:
ByteBuffer.get(int)(per-call bounds checks + virtual dispatch throughHeapByteBuffer/DirectByteBuffer).stream * valuesCount) is recomputed on every iteration even though it depends only on the outer loop.For a 100k-value
FLOATpage that is 400kByteBuffer.get(int)calls; forDOUBLE/LONGit is 800k.What changes are included in this PR?
Rewrite
decodeDatain three steps:Drop down to a
byte[]view of the encoded buffer. Whenencoded.hasArray()is true (the typical case), use the backing array directly with the correct base offset; otherwise copy once with a singleget(byte[])call. Eliminates the per-byteByteBuffer.get(int)bounds check and virtual dispatch.Specialize loops for the common element sizes (4 and 8). Hoist all
stream * valuesCountoffsets into local ints (s0..s3for floats/ints,s0..s7for doubles/longs) and write each output slot exactly once in a single sequential pass. Reads come fromelementSizeInBytesconcurrent sequential streams, which modern hardware prefetchers handle well.Generic fallback for arbitrary element sizes (
FIXED_LEN_BYTE_ARRAYof any width).Benchmark
New
ByteStreamSplitDecodingBenchmark(100k values per invocation, JDK 18, JMH-wi 5 -i 10 -f 3, 30 samples per row):Decoded output is byte-identical to before; per-op heap allocation is unchanged.
Are these changes tested?
Yes. All 573
parquet-columntests pass; 51 BSS-specific tests pass (mvn test -pl parquet-column -Dtest='*ByteStreamSplit*'). No new test was added because the decoded bytes are unchanged (covered by existing round-trip andByteStreamSplitValuesReaderTesttests).Are there any user-facing changes?
No. Only an internal reader optimization. No public API, file format, or configuration change.
Closes #3505
Symmetric companion to #3504 (writer-side BSS optimization). Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500, #3504.
How to reproduce the benchmarks
The JMH benchmarks cited above are being added to
parquet-benchmarksin #3512. Once that lands, reproduce with:Compare runs against
master(baseline) and this branch (optimized).